Breast cancer is a dangerous disease for women. If it does not identify in the early-stage then the result will be the death of the patient. It is a common cancer in women worldwide. Worldwide near about 12% of women affected by breast cancer and the number is still increasing. The doctors do not identify each and every breast cancer patient. That’s the reason Machine Learning Engineer / Data Scientist comes into the picture because they have knowledge of maths and computational power.
We have extracted features of breast cancer patient cells and normal person cells. As a Machine learning engineer / Data Scientist has to create an ML model to classify malignant and benign tumor. To complete this ML project we are using the supervised machine learning classifier algorithm.
Benign tumors aren't cancer. Malignant ones are. Benign tumors grow only in one place. They cannot spread or invade other parts of your body.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
We are loading breast cancer data using a scikit-learn load_brast_cancer class.
from sklearn.datasets import load_breast_cancer
cancer_dataset = load_breast_cancer()
cancer_dataset
type(cancer_dataset)
The scikit-learn store data in an object bunch like a dictionary.
# Keys in dataset
cancer_dataset.keys()
cancer_dataset["target"]
cancer_dataset["target_names"]
Malignant tumor means the patient have breast cancer but Benign tumors aren't cancer. In my dataset Malignant = 0 and Benign =1
print(cancer_dataset["feature_names"])
Now, we are creating DataFrame by concate ‘data’ and ‘target’ together and give columns name.
dataset = pd.DataFrame(np.c_[cancer_dataset["data"],cancer_dataset["target"]],
columns = np.append(cancer_dataset["feature_names"],["target"]))
dataset.head()
dataset.shape
dataset.info()
dataset.isnull().sum()
dataset["target"].value_counts()
dataset.describe()
Pair plot of breast cancer data. Basically, the pair plot is used to show the numeric distribution in the scatter plot.
sns.pairplot(dataset,hue="target")
plt.show()
sns.pairplot(data=dataset,hue="target",vars=["mean radius","mean texture","mean perimeter","mean area","mean smoothness"])
plt.show()
sns.countplot(x="target",data=dataset)
plt.figure(figsize=(20,8))
sns.countplot(x="mean radius",hue="target",data=dataset)
plt.show()
plt.figure(figsize=(16,11))
sns.heatmap(dataset)
plt.show()
In the below heatmap we can see the variety of different feature’s value. The value of feature ‘mean area’ and ‘worst area’ are greater than other and ‘mean perimeter’, ‘area error’, and ‘worst perimeter’ value slightly less but greater than remaining features.
dataset.corr()
plt.figure(figsize=(20,20))
sns.heatmap(dataset.corr(),annot=True,cmap="coolwarm")
plt.show()
Taking the correlation of each feature with the target and the visualize barplot.
dataset2=dataset.drop("target",axis=1)
plt.figure(figsize = (16,8))
result=sns.barplot(dataset2.corrwith(dataset.target).index,dataset2.corrwith(dataset.target))
result.tick_params(labelrotation = 90)
Split DataFrame in train and test
x = dataset.drop("target",axis=1)
x.head(3)
y = dataset["target"]
y.head(3)
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)
Converting different units and magnitude data in one unit.
from sklearn.preprocessing import StandardScaler
scaling = StandardScaler()
x_train_scaling_data = scaling.fit_transform(x_train)
x_test_scaling_data = scaling.transform(x_test)
We have clean data to build the Ml model. But which Machine learning algorithm is best for the data we have to find. The output is a categorical format so we will use supervised classification machine learning algorithms.
To build the best model, we have to train and test the dataset with multiple Machine Learning algorithms then we can find the best ML model. So let’s try.
First, we need to import the required packages.
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
# Trained with out Standard Scaled Data
from sklearn.linear_model import LogisticRegression
logistic_model = LogisticRegression()
logistic_model.fit(x_train,y_train)
y_pred_logistic_model = logistic_model.predict(x_test)
accuracy_score(y_pred_logistic_model,y_test)
logistic_model.score(x_train,y_train)
logistic_model.score(x_test,y_test)
# Trained with Standard Scaled Data
logistic_model2 = LogisticRegression()
logistic_model2.fit(x_train_scaling_data,y_train)
y_pred_logistic_model2 = logistic_model2.predict(x_test_scaling_data)
accuracy_score(y_test,y_pred_logistic_model2)
logistic_model2.score(x_train_scaling_data,y_train)
logistic_model2.score(x_test_scaling_data,y_test)
y_pred = logistic_model.predict(x_test)
Accuracy= confusion_matrix(y_test,y_pred)
Accuracy
cm = confusion_matrix(y_test,y_pred)
plt.title('Heatmap of Confusion Matrix', fontsize = 15)
sns.heatmap(cm, annot = True)
plt.show()
# Trained with out Standard Scaled Data
from sklearn.ensemble import RandomForestClassifier
random_forest_model = RandomForestClassifier(n_estimators= 20,criterion= 'entropy',random_state=42)
random_forest_model.fit(x_train,y_train)
y_pred_random_forest_model = random_forest_model.predict(x_test)
accuracy_score(y_pred_random_forest_model ,y_test)
random_forest_model.score(x_train,y_train)
random_forest_model.score(x_test,y_test)
# Trained with Standard Scaled Data
random_forest_model2 = RandomForestClassifier(n_estimators= 20,criterion='entropy',random_state=42)
random_forest_model2.fit(x_train_scaling_data,y_train)
y_pred_random_forest_model2 = random_forest_model2.predict(x_test_scaling_data)
accuracy_score(y_test,y_pred_random_forest_model2)
random_forest_model2.score(x_train_scaling_data,y_train)
random_forest_model2.score(x_test_scaling_data,y_test)
prediction = random_forest_model.predict(x_test)
Accuracy= confusion_matrix(y_test,prediction)
Accuracy
Y_PRED = random_forest_model2.predict(x_test_scaling_data)
A= confusion_matrix(y_test,Y_PRED)
A
from sklearn.metrics import classification_report
print(classification_report(y_test,prediction))
cm = confusion_matrix(y_test,prediction)
plt.title('Heatmap of Confusion Matrix', fontsize = 15)
sns.heatmap(cm, annot = True)
plt.show()